ExtraaLearn Project¶
Context¶
The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market would be worth $286.62bn by 2023 with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc, it is now preferable to traditional education.
In the present scenario due to the Covid-19, the online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like
- The customer interacts with the marketing front on social media or other online platforms.
- The customer browses the website/app and downloads the brochure
- The customer connects through emails for more information.
The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.
Objective¶
ExtraaLearn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:
- Analyze and build an ML model to help identify which leads are more likely to convert to paid customers,
- Find the factors driving the lead conversion process
- Create a profile of the leads which are likely to convert
Data Description¶
The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.
Data Dictionary
ID: ID of the lead
age: Age of the lead
current_occupation: Current occupation of the lead. Values include 'Professional','Unemployed',and 'Student'
first_interaction: How did the lead first interacted with ExtraaLearn. Values include 'Website', 'Mobile App'
profile_completed: What percentage of profile has been filled by the lead on the website/mobile app. Values include Low - (0-50%), Medium - (50-75%), High (75-100%)
website_visits: How many times has a lead visited the website
time_spent_on_website: Total time spent on the website
page_views_per_visit: Average number of pages on the website viewed during the visits.
last_activity: Last interaction between the lead and ExtraaLearn.
- Email Activity: Seeking for details about program through email, Representative shared information with lead like brochure of program , etc
- Phone Activity: Had a Phone Conversation with representative, Had conversation over SMS with representative, etc
- Website Activity: Interacted on live chat with representative, Updated profile on website, etc
print_media_type1: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Newspaper.
print_media_type2: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Magazine.
digital_media: Flag indicating whether the lead had seen the ad of ExtraaLearn on the digital platforms.
educational_channels: Flag indicating whether the lead had heard about ExtraaLearn in the education channels like online forums, discussion threads, educational websites, etc.
referral: Flag indicating whether the lead had heard about ExtraaLearn through reference.
status: Flag indicating whether the lead was converted to a paid customer or not.
Importing necessary libraries and data¶
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
import sklearn.metrics as metrics
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
classification_report,
roc_auc_score,
precision_recall_curve,
roc_curve,
make_scorer
)
# Mount Gdrive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# Read the dataset
data = pd.read_csv("/content/drive/MyDrive/Python Course/ExtraaLearn.csv")
Data Overview¶
- Observations
- Sanity checks
data.shape
(4612, 15)
data.head()
| ID | age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | EXT001 | 57 | Unemployed | Website | High | 7 | 1639 | 1.86100 | Website Activity | Yes | No | Yes | No | No | 1 |
| 1 | EXT002 | 56 | Professional | Mobile App | Medium | 2 | 83 | 0.32000 | Website Activity | No | No | No | Yes | No | 0 |
| 2 | EXT003 | 52 | Professional | Website | Medium | 3 | 330 | 0.07400 | Website Activity | No | No | Yes | No | No | 0 |
| 3 | EXT004 | 53 | Unemployed | Website | High | 4 | 464 | 2.05700 | Website Activity | No | No | No | No | No | 1 |
| 4 | EXT005 | 23 | Student | Website | High | 4 | 600 | 16.91400 | Email Activity | No | No | No | No | No | 0 |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4612 entries, 0 to 4611 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 4612 non-null object 1 age 4612 non-null int64 2 current_occupation 4612 non-null object 3 first_interaction 4612 non-null object 4 profile_completed 4612 non-null object 5 website_visits 4612 non-null int64 6 time_spent_on_website 4612 non-null int64 7 page_views_per_visit 4612 non-null float64 8 last_activity 4612 non-null object 9 print_media_type1 4612 non-null object 10 print_media_type2 4612 non-null object 11 digital_media 4612 non-null object 12 educational_channels 4612 non-null object 13 referral 4612 non-null object 14 status 4612 non-null int64 dtypes: float64(1), int64(4), object(10) memory usage: 540.6+ KB
data.isna().sum()
| 0 | |
|---|---|
| ID | 0 |
| age | 0 |
| current_occupation | 0 |
| first_interaction | 0 |
| profile_completed | 0 |
| website_visits | 0 |
| time_spent_on_website | 0 |
| page_views_per_visit | 0 |
| last_activity | 0 |
| print_media_type1 | 0 |
| print_media_type2 | 0 |
| digital_media | 0 |
| educational_channels | 0 |
| referral | 0 |
| status | 0 |
No null values in the dataset.
data.duplicated().sum()
np.int64(0)
No duplicates.
data['ID'].nunique()
4612
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| age | 4612.00000 | 46.20121 | 13.16145 | 18.00000 | 36.00000 | 51.00000 | 57.00000 | 63.00000 |
| website_visits | 4612.00000 | 3.56678 | 2.82913 | 0.00000 | 2.00000 | 3.00000 | 5.00000 | 30.00000 |
| time_spent_on_website | 4612.00000 | 724.01127 | 743.82868 | 0.00000 | 148.75000 | 376.00000 | 1336.75000 | 2537.00000 |
| page_views_per_visit | 4612.00000 | 3.02613 | 1.96812 | 0.00000 | 2.07775 | 2.79200 | 3.75625 | 18.43400 |
| status | 4612.00000 | 0.29857 | 0.45768 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 1.00000 |
Data Preprocessing¶
- Missing value treatment (if needed)
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)
data_clean = data.copy()
# Drop unnecessary columns
data_clean.drop(['ID'], axis=1, inplace=True)
reusable_map = {'Low': 1, 'Medium': 2, 'High': 3}
data_clean['profile_completed'] = data_clean['profile_completed'].map(reusable_map)
# Change column names
data_clean = data_clean.rename(columns={'print_media_type1': 'newspaper', 'print_media_type2': 'magazine'})
# Define channel columns
channel_columns = ['newspaper', 'magazine', 'digital_media', 'educational_channels', 'referral']
# Define categorical columns
categorical_cols = data_clean.select_dtypes(include=['object', 'category']).columns.tolist()
print(categorical_cols)
['current_occupation', 'first_interaction', 'last_activity', 'newspaper', 'magazine', 'digital_media', 'educational_channels', 'referral']
# Define numerical columns
numeric_cols = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']
# Outlier detection and treatment (IQR capping)
for col in numeric_cols:
Q1 = data_clean[col].quantile(0.25)
Q3 = data_clean[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
data_clean[col] = np.where(data_clean[col] < lower_bound, lower_bound, np.where(data_clean[col] > upper_bound, upper_bound, data_clean[col]))
Let's create a feature to measure the engagement status for each customer.
# Assign weights (example: visits=0.3, time=0.5, pages=0.2)
weights = {'website_visits': 0.3, 'time_spent_on_website': 0.5, 'page_views_per_visit': 0.2}
data_clean['engagement_score'] = (
data_clean['website_visits'] * weights['website_visits'] +
data_clean['time_spent_on_website'] * weights['time_spent_on_website'] +
data_clean['page_views_per_visit'] * weights['page_views_per_visit']
)
data_clean.head()
| age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | newspaper | magazine | digital_media | educational_channels | referral | status | engagement_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 57.00000 | Unemployed | Website | 3 | 7.00000 | 1639.00000 | 1.86100 | Website Activity | Yes | No | Yes | No | No | 1 | 821.97220 |
| 1 | 56.00000 | Professional | Mobile App | 2 | 2.00000 | 83.00000 | 0.32000 | Website Activity | No | No | No | Yes | No | 0 | 42.16400 |
| 2 | 52.00000 | Professional | Website | 2 | 3.00000 | 330.00000 | 0.07400 | Website Activity | No | No | Yes | No | No | 0 | 165.91480 |
| 3 | 53.00000 | Unemployed | Website | 3 | 4.00000 | 464.00000 | 2.05700 | Website Activity | No | No | No | No | No | 1 | 233.61140 |
| 4 | 23.00000 | Student | Website | 3 | 4.00000 | 600.00000 | 6.27400 | Email Activity | No | No | No | No | No | 0 | 302.45480 |
Exploratory Data Analysis (EDA)¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Questions
- Leads will have different expectations from the outcome of the course and the current occupation may play a key role in getting them to participate in the program. Find out how current occupation affects lead status.
- The company's first impression on the customer must have an impact. Do the first channels of interaction have an impact on the lead status?
- The company uses multiple modes to interact with prospects. Which way of interaction works best?
- The company gets leads from various channels such as print media, digital media, referrals, etc. Which of these channels have the highest lead conversion rate?
- People browsing the website or mobile application are generally required to create a profile by sharing their personal data before they can access additional information.Does having more details about a prospect increase the chances of conversion?
EDA¶
- It is a good idea to explore the data once again after manipulating it.
Criteria
Exploratory Data Analysis
- Problem definition
- Univariate analysis
- Bivariate analysis
- Provide comments on the visualization such as range of attributes, outliers of various attributes.
- Provide comments on the distribution of the variables
- Use appropriate visualizations to identify the patterns and insights
- Key meaningful observations on individual variables and the relationship between variables
Univariate analysis¶
# Function to plot a boxplot and a histogram along the same scale
def histogram_boxplot(data, feature, kde = False, bins = None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12, 7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows = 2, # Number of rows of the subplot grid = 2
sharex = True, # X-axis will be shared among all subplots
gridspec_kw = {"height_ratios": (0.25, 0.75)}) # Creating the 2 subplots
sns.boxplot(
data = data, x = feature, ax = ax_box2, showmeans = True, color = "violet"
) # Boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data = data, x = feature, kde = kde, ax = ax_hist2, bins = bins, palette = "winter"
) if bins else sns.histplot(
data = data, x = feature, kde = kde, ax = ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color = "green", linestyle = "--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color = "black", linestyle = "-"
) # Add median to the histogram
plt.tight_layout() # Auto-adjusts the positions of the plots
plt.show();
print(feature, 'Skewness: %f' % data_clean[feature].skew()) # Print the skewness of the column
print('*' * 20)
Numerical columns analysis¶
for col in numeric_cols:
histogram_boxplot(data_clean, col, kde=True)
age Skewness: -0.720022 ********************
website_visits Skewness: 0.891856 ********************
time_spent_on_website Skewness: 0.952928 ********************
page_views_per_visit Skewness: 0.204383 ********************
Observations on age: Skewed to the left with a mean age of 46 years old.
Observations on website_visits: Skewed to the right, usually 2 or less visits per customer.
Observations on number of time_spent_on_website: Skewed to the right with a mean time spent on the website of 724, however with a high standard deviation of 743.
Observations on number of page_views_per_visit: Multi-modal distribution with most customers visiting between 2 and 3 pages, some spikes around 0 and 6.
Categorical columns analysis¶
# Graph style setup
sns.set(style="whitegrid")
# Graph all the categorical features
for col in categorical_cols:
# Absolute count and percentage
value_counts = data_clean[col].value_counts()
percentage = (value_counts / len(data_clean)) * 100
# Display table of count and percentage
print(f"\n--- {col} ---")
print(pd.DataFrame({'Count': value_counts, '(%)': percentage.round(2)}))
# Barplot
plt.figure(figsize=(8, 5))
sns.barplot(x=value_counts.index, y=value_counts.values, palette="viridis")
plt.title(f"Distribution of {col}", fontsize=14)
plt.xlabel(col, fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
--- current_occupation ---
Count (%)
current_occupation
Professional 2616 56.72000
Unemployed 1441 31.24000
Student 555 12.03000
--- first_interaction ---
Count (%)
first_interaction
Website 2542 55.12000
Mobile App 2070 44.88000
--- last_activity ---
Count (%)
last_activity
Email Activity 2278 49.39000
Phone Activity 1234 26.76000
Website Activity 1100 23.85000
--- newspaper ---
Count (%)
newspaper
No 4115 89.22000
Yes 497 10.78000
--- magazine ---
Count (%)
magazine
No 4379 94.95000
Yes 233 5.05000
--- digital_media ---
Count (%)
digital_media
No 4085 88.57000
Yes 527 11.43000
--- educational_channels ---
Count (%)
educational_channels
No 3907 84.71000
Yes 705 15.29000
--- referral ---
Count (%)
referral
No 4519 97.98000
Yes 93 2.02000
Observations on current_occupation: Most of our leads are Professionals, followed by Unemployed and Students.
Observations on number of first_interaction: 55% of first interactions came from the website, followed by Mobile App with 45%.
Observations on last_activity: Very high activity via email, while equally regular by phone or website.
Observations on profile_completed: very good numbers for Medium and High profile completion. Just 2% of low completion profiles.
Observations on newspaper: 10% of the leads come from newspaper.
Observations on magazine (print_media_type2): 5% of the leads come from magazine.
Observations on digital_media: 11.4% of the leads come from digital media.
Observations on educational_channels 15.2% of the leads come from educational channels.
Observations on referral: Only 2% of leads come from referrals.
Observations from Univariate Analysis:The distribution of variables shows a strong preference for digital and educational channels, with high engagement through email and web platforms. Most leads have well-completed profiles, indicating good data quality.
# Calculate conversion % per occupation
conversion = data_clean.groupby('current_occupation')['status'].mean().reset_index()
conversion['conversion_%'] = conversion['status'] * 100
print(conversion)
# Barplot
plt.figure(figsize=(8,5))
sns.barplot(x='current_occupation', y='conversion_%', data=conversion, palette='viridis')
plt.title('Conversion rate per occupation')
plt.ylabel('Conversion rate (%)')
plt.xlabel('Current Occupation')
plt.ylim(0, 100)
plt.show()
current_occupation status conversion_% 0 Professional 0.35512 35.51223 1 Student 0.11712 11.71171 2 Unemployed 0.26579 26.57876
Most of the conversion comes from Professional or Unemployed leads.
Bivariate analysis¶
# Correlation check
plt.figure(figsize=(10, 7))
sns.heatmap(
data_clean[numeric_cols].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.show();
No important correlation between the numeric columns. Page views per visit is somewhat related to the number of website visits.
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
conversion = data_clean[data_clean[predictor] == 'Yes'][target].mean() * 100
print(f"% of conversion for {predictor} = 'Yes': {conversion:.2f}%")
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
Leads will have different expectations from the outcome of the course and the current occupation may play a key role for them to take the program. Let's analyze it
stacked_barplot(data=data_clean, predictor="current_occupation", target="status")
% of conversion for current_occupation = 'Yes': nan% status 0 1 All current_occupation All 3235 1377 4612 Professional 1687 929 2616 Unemployed 1058 383 1441 Student 490 65 555 ------------------------------------------------------------------------------------------------------------------------
The highest amount of converted leads comes from customers at a Professional level.
Age can be a good factor to differentiate between such leads
plt.figure(figsize=(10, 5))
sns.boxplot(data = data_clean, x = "current_occupation", y = "age")
plt.show()
data_clean.groupby(["current_occupation"])["age"].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| current_occupation | ||||||||
| Professional | 2616.00000 | 49.34748 | 9.89074 | 25.00000 | 42.00000 | 54.00000 | 57.00000 | 60.00000 |
| Student | 555.00000 | 21.14414 | 2.00111 | 18.00000 | 19.00000 | 21.00000 | 23.00000 | 25.00000 |
| Unemployed | 1441.00000 | 50.14018 | 9.99950 | 32.00000 | 42.00000 | 54.00000 | 58.00000 | 63.00000 |
Students are our youngest group. Unemployed and Professional have a higher variance with a mean age around 50 years.
The company's first interaction with leads should be compelling and persuasive. Let's see if the channels of the first interaction have an impact on the conversion of leads
# Complete the code to plot stacked_barplot for first_interaction and status
stacked_barplot(data=data_clean, predictor="first_interaction", target="status")
% of conversion for first_interaction = 'Yes': nan% status 0 1 All first_interaction All 3235 1377 4612 Website 1383 1159 2542 Mobile App 1852 218 2070 ------------------------------------------------------------------------------------------------------------------------
Low use of Mobile App --needs improvement, increase appeal, offer signup discounts.
Website has medium impact in conversion.
# checking the median value
data_clean.groupby(["status"])["time_spent_on_website"].median()
| time_spent_on_website | |
|---|---|
| status | |
| 0 | 317.00000 |
| 1 | 789.00000 |
Leads that do convert spend more than double the amount of time browsing the website.
Let's create a feature for customer engagement by using webtime visits, time spend and page views per visit.
People browsing the website or the mobile app are generally required to create a profile by sharing their personal details before they can access more information. Let's see if the profile completion level has an impact on lead status
stacked_barplot(data=data_clean, predictor='profile_completed', target='status')
% of conversion for profile_completed = 'Yes': nan% status 0 1 All profile_completed All 3235 1377 4612 3 1318 946 2264 2 1818 423 2241 1 99 8 107 ------------------------------------------------------------------------------------------------------------------------
A higher profile completion is positively correlated to a higher conversion rate.
After a lead shares their information by creating a profile, there may be interactions between the lead and the company to proceed with the process of enrollment. Let's see how the last activity impacts lead conversion status
stacked_barplot(data=data_clean, predictor='last_activity', target='status')
% of conversion for last_activity = 'Yes': nan% status 0 1 All last_activity All 3235 1377 4612 Email Activity 1587 691 2278 Website Activity 677 423 1100 Phone Activity 971 263 1234 ------------------------------------------------------------------------------------------------------------------------
Most lead-related purchases come from customers whose last interaction was through the website.
Let's see how advertisement and referrals impact the lead status
stacked_barplot(data=data_clean, predictor='newspaper', target='status')
% of conversion for newspaper = 'Yes': 31.99% status 0 1 All newspaper All 3235 1377 4612 No 2897 1218 4115 Yes 338 159 497 ------------------------------------------------------------------------------------------------------------------------
Out of 497 leads coming from Newspaper, 32% of leads converted to sales.
stacked_barplot(data=data_clean, predictor='magazine', target='status')
% of conversion for magazine = 'Yes': 32.19% status 0 1 All magazine All 3235 1377 4612 No 3077 1302 4379 Yes 158 75 233 ------------------------------------------------------------------------------------------------------------------------
75/233 leads converted coming from magazines. 32% conversion rate.
stacked_barplot(data=data_clean, predictor='digital_media', target='status') # Complete the code to plot stacked_barplot for digital_media and status
% of conversion for digital_media = 'Yes': 31.88% status 0 1 All digital_media All 3235 1377 4612 No 2876 1209 4085 Yes 359 168 527 ------------------------------------------------------------------------------------------------------------------------
168/527 leads converted coming from Digital Media. Again 32% conversion rate.
stacked_barplot(data=data_clean, predictor='educational_channels', target='status') # Complete the code to plot stacked_barplot for educational_channels and status
% of conversion for educational_channels = 'Yes': 27.94% status 0 1 All educational_channels All 3235 1377 4612 No 2727 1180 3907 Yes 508 197 705 ------------------------------------------------------------------------------------------------------------------------
197/705 leads converted coming from Educational Channels. Not a great conversion rate at almost 28%.
stacked_barplot(data=data_clean, predictor='referral', target='status') # Complete the code to plot stacked_barplot for referral and status
% of conversion for referral = 'Yes': 67.74% status 0 1 All referral All 3235 1377 4612 No 3205 1314 4519 Yes 30 63 93 ------------------------------------------------------------------------------------------------------------------------
Very good conversion rate for referrals at 67.7%
So overall among paid media there's a 32% lead conversion rate or less (1 in 3).
Referrals are the leads most likely to convert to paid customers (2 out of 3).
Let's compare the preferred channels.
# Sum 'Yes' for each channel
channel_counts = {}
for col in channel_columns:
count_yes = (data_clean[col].astype(str) == 'Yes').sum()
percentage_yes = (count_yes / len(data_clean)) * 100
channel_counts[col] = {'Count_yes': count_yes, '(%)': round(percentage_yes, 2)}
# Convert to DataFrame
channel_df = pd.DataFrame(channel_counts).T.reset_index()
channel_df.rename(columns={'index': 'Channel'}, inplace=True)
# Visualización
plt.figure(figsize=(8, 5))
sns.barplot(x='Channel', y='Count_yes', data=channel_df, palette='mako')
plt.title('Best channels', fontsize=14)
plt.xlabel('Channel', fontsize=12)
plt.ylabel('# of Leads', fontsize=12)
plt.xticks(rotation=30)
plt.tight_layout()
plt.show()
# Mostrar tabla
print(channel_df)
Channel Count_yes (%) 0 newspaper 497.00000 10.78000 1 magazine 233.00000 5.05000 2 digital_media 527.00000 11.43000 3 educational_channels 705.00000 15.29000 4 referral 93.00000 2.02000
Most of the leads come from Educational Channels, followed by Digital Media and Newspaper.
Educational channels are the best lead generators, followed by digital media and newspaper. Referrals generate the least amount of leads.
Outlier Check¶
# outlier detection using boxplot
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_cols):
plt.subplot(4, 4, i + 1)
plt.boxplot(data_clean[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Data preparation for modeling.¶
data_clean.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4612 entries, 0 to 4611 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 4612 non-null float64 1 current_occupation 4612 non-null object 2 first_interaction 4612 non-null object 3 profile_completed 4612 non-null int64 4 website_visits 4612 non-null float64 5 time_spent_on_website 4612 non-null float64 6 page_views_per_visit 4612 non-null float64 7 last_activity 4612 non-null object 8 newspaper 4612 non-null object 9 magazine 4612 non-null object 10 digital_media 4612 non-null object 11 educational_channels 4612 non-null object 12 referral 4612 non-null object 13 status 4612 non-null int64 14 engagement_score 4612 non-null float64 dtypes: float64(5), int64(2), object(8) memory usage: 540.6+ KB
data_clean.columns
Index(['age', 'current_occupation', 'first_interaction', 'profile_completed',
'website_visits', 'time_spent_on_website', 'page_views_per_visit',
'last_activity', 'newspaper', 'magazine', 'digital_media',
'educational_channels', 'referral', 'status', 'engagement_score'],
dtype='object')
# Numerical variables standardization
scaler = StandardScaler()
scaled_data = data_clean.copy()
numeric_cols.append('engagement_score')
scaled_data[numeric_cols] = scaler.fit_transform(scaled_data[numeric_cols])
# One-hot encoding to all categorical columns
data_encoded = pd.get_dummies(scaled_data, columns=categorical_cols, drop_first=True)
# Remove duplicates
data_encoded = data_encoded.loc[:, ~data_encoded.columns.duplicated()]
print(data_encoded.head())
age profile_completed website_visits time_spent_on_website \ 0 0.82057 3 1.48374 1.23024 1 0.74459 2 -0.60598 -0.86187 2 0.44064 2 -0.18804 -0.52976 3 0.51662 3 0.22991 -0.34960 4 -1.76301 3 0.22991 -0.16674 page_views_per_visit status engagement_score current_occupation_Student \ 0 -0.63593 1 1.23228 False 1 -1.56538 0 -0.86425 False 2 -1.71375 0 -0.53154 False 3 -0.51772 1 -0.34954 False 4 2.02573 0 -0.16445 True current_occupation_Unemployed first_interaction_Website \ 0 True True 1 False False 2 False True 3 True True 4 False True last_activity_Phone Activity last_activity_Website Activity \ 0 False True 1 False True 2 False True 3 False True 4 False False newspaper_Yes magazine_Yes digital_media_Yes educational_channels_Yes \ 0 True False True False 1 False False False True 2 False False True False 3 False False False False 4 False False False False referral_Yes 0 False 1 False 2 False 3 False 4 False
X = data_encoded.drop('status', axis=1)
Y = data["status"] # Define the dependent variable (target)
print(X.columns)
# Splitting the data in 70:30 ratio for train to test data
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
Index(['age', 'profile_completed', 'website_visits', 'time_spent_on_website',
'page_views_per_visit', 'engagement_score',
'current_occupation_Student', 'current_occupation_Unemployed',
'first_interaction_Website', 'last_activity_Phone Activity',
'last_activity_Website Activity', 'newspaper_Yes', 'magazine_Yes',
'digital_media_Yes', 'educational_channels_Yes', 'referral_Yes'],
dtype='object')
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (3228, 16) Shape of test set : (1384, 16) Percentage of classes in training set: status 0 0.70415 1 0.29585 Name: proportion, dtype: float64 Percentage of classes in test set: status 0 0.69509 1 0.30491 Name: proportion, dtype: float64
Percentage of classes in training set:
Meaning: In the training set, 70.4% of the examples have status = 0 (not converted) and 29.6% have status = 1 (converted).
Interpretation: There is a moderate imbalance in the classes, but it's not extreme. The model will see more examples of unconverted leads than converted ones.
Percentage of classes in test set:
Meaning: On the test set, 69.5% of the examples have status = 0 and 30.5% have status = 1.
Interpretation: The proportion of classes on the test set is very similar to that on the training set, which is good for evaluating the model fairly.
Building a Decision Tree model¶
# Import libraries
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve
# Create the model
dt_model = DecisionTreeClassifier(random_state=1)
# Train the model
dt_model.fit(X_train, y_train)
# Predict on test set
y_pred = dt_model.predict(X_test)
print(len(y_pred))
# Accuracy
print("Accuracy on test set:", accuracy_score(y_test, y_pred))
# Get confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Graph CM
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Decision Tree Confusion Matrix')
plt.show()
# Confusion matrix
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
1384 Accuracy on test set: 0.8078034682080925
Confusion Matrix: [[830 132] [134 288]]
The Decision Tree model achieved an accuracy of 0.807 on the test set, indicating good overall performance in distinguishing between converted and unconverted leads.
The confusion matrix shows that the model correctly classified 830 unconverted leads and 288 converted leads. However, 134 leads that did convert were misclassified as unconverted (false negatives), and 132 unconverted leads were misclassified as converted (false positives). The model performs reasonably well, but could be improved to reduce false negatives if the business prioritizes not missing out on conversion opportunities.
print("\nClassification Report:\n", classification_report(y_test, y_pred))
plt.figure(figsize=(20,10))
tree.plot_tree(dt_model, feature_names=X.columns, class_names=['Not Converted', 'Converted'], filled=True, max_depth=3)
plt.title("Decision Tree (first 3 levels)")
plt.show()
Classification Report:
precision recall f1-score support
0 0.86 0.86 0.86 962
1 0.69 0.68 0.68 422
accuracy 0.81 1384
macro avg 0.77 0.77 0.77 1384
weighted avg 0.81 0.81 0.81 1384
The Decision Tree model shows an overall accuracy of 81%. Class 0 (non-converted) has better metrics with precision and recall at 0.86, than Class 1 (converted), at around 68%. This indicates that the model is more effective at identifying leads that do not convert, but has room for improvement in identifying leads that do convert.
Do we need to prune the tree?¶
Pruning the decision tree (using parameters like max_depth or min_samples_leaf) helps prevent overfitting and improves the model’s ability to generalize to new data. It is a recommended best practice, especially if the unpruned tree shows signs of overfitting.
# Define the range of depths to be tested
param_grid = {'max_depth': range(2, 21)} # Depth 2 - 20
# Configure GridSearchCV
grid_search = GridSearchCV(estimator=dt_model, param_grid=param_grid, cv=5, scoring='accuracy')
# Fit the model to the training data
grid_search.fit(X_train, y_train)
# Better depth and average accuracy
best_depth = grid_search.best_params_['max_depth']
best_score = grid_search.best_score_
print(f"Best tree depth: {best_depth}")
print(f"Average accuracy (cross-validation): {best_score:.4f}")
# Graphing accuracy vs. depth
results = grid_search.cv_results_
plt.figure(figsize=(8,5))
sns.lineplot(x=param_grid['max_depth'], y=results['mean_test_score'], marker='o')
plt.title('Accuracy vs. Tree Depth (Cross Validation)')
plt.xlabel('Max Depth')
plt.ylabel('Average accuracy (CV)')
plt.axvline(x=best_depth, color='red', linestyle=':', linewidth=2, label=f'Best Depth = {best_depth}')
plt.grid(True)
plt.show()
Best tree depth: 5 Average accuracy (cross-validation): 0.8569
The best depth for the decision tree is 5, and at that depth the model achieves an average accuracy of 85.69% on the training data.
# Train the tree with the best depth
dt_best = DecisionTreeClassifier(max_depth=best_depth, random_state=1)
dt_best.fit(X_train, y_train)
# Evaluate model on test set
y_pred = dt_best.predict(X_test)
print("Accuracy on test set:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Accuracy on test set: 0.8547687861271677
precision recall f1-score support
0 0.89 0.91 0.90 962
1 0.78 0.73 0.75 422
accuracy 0.85 1384
macro avg 0.83 0.82 0.83 1384
weighted avg 0.85 0.85 0.85 1384
By pruning the decision tree and selecting the optimal depth using cross-validation, the model's accuracy on the test set increased from 85% to 86%. Furthermore, the F1 score for the converted lead class improved from 0.68 to 0.75, indicating the model's improved ability to correctly identify leads most likely to convert. Precision and recall also increase considerably. This demonstrates that pruning helps prevent overfitting and improves the model's generalization capabilities.
# Plotting tree with best depth
plt.figure(figsize=(20,10))
tree.plot_tree(dt_best, feature_names=X.columns, class_names=['Not Converted', 'Converted'], filled=True, max_depth=best_depth)
plt.title(f"Decision Tree (first {best_depth} levels)")
plt.show()
# Obtain the importance of each variable
feature_importances = pd.DataFrame({
'Variable': X.columns,
'Importance': dt_best.feature_importances_
}).sort_values(by='Importance', ascending=False)
# Visualize top 10 important variables
plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Variable', data=feature_importances.head(10), palette='viridis')
plt.title(f'Top 10 most important features in the decision tree (max_depth={best_depth})', fontsize=14)
plt.xlabel('Importance')
plt.ylabel('Variable')
plt.tight_layout()
plt.show()
print(feature_importances)
Variable Importance 3 time_spent_on_website 0.27123 8 first_interaction_Website 0.26656 1 profile_completed 0.20862 7 current_occupation_Unemployed 0.06739 6 current_occupation_Student 0.05784 9 last_activity_Phone Activity 0.05307 0 age 0.03922 10 last_activity_Website Activity 0.01848 4 page_views_per_visit 0.01185 2 website_visits 0.00573 5 engagement_score 0.00000 11 newspaper_Yes 0.00000 12 magazine_Yes 0.00000 13 digital_media_Yes 0.00000 14 educational_channels_Yes 0.00000 15 referral_Yes 0.00000
Focus on the top features: The Decision Tree model indicates that the most influential factors for lead conversion are the time spent on the website, whether the first interaction was through the website, and having a highly completed profile. These should be prioritized in marketing and lead nurturing strategies.
Business recommendation: Efforts to increase user engagement on the website and encourage users to complete their profiles may significantly improve conversion rates.
On low-importance features: Channels like newspaper, magazine, digital media, educational channels, and referrals did not show predictive power in this model, suggesting they may be less relevant for targeting high-conversion leads in this dataset.
from sklearn.metrics import roc_curve, auc
# Get predicted probabilities for the positive class
y_proba = dt_best.predict_proba(X_test)[:, 1]
# Calculate ROC curve and AUC
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
# Plot ROC Curve
plt.figure(figsize=(6,6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0,1],[0,1], color='gray', linestyle='--')
plt.title('ROC Curve', fontsize=14)
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"ROC-AUC Score: {roc_auc:.3f}")
ROC-AUC Score: 0.923
The AUC (Area Under the Curve) quantifies the overall ability of the model to discriminate between positive and negative classes.
- AUC = 0.92: Excellent performance, the model is very good at distinguishing between classes.
Building a Random Forest model¶
# Train Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=1)
rf_model.fit(X_train, y_train)
# Predictions
y_pred = rf_model.predict(X_test)
# Model evaluation
print("Accuracy on test set:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Random Forest Confusion Matrix')
plt.show()
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
Accuracy on test set: 0.8547687861271677
Classification Report:
precision recall f1-score support
0 0.87 0.93 0.90 962
1 0.80 0.69 0.74 422
accuracy 0.85 1384
macro avg 0.84 0.81 0.82 1384
weighted avg 0.85 0.85 0.85 1384
Confusion Matrix: [[890 72] [129 293]]
The confusion matrix from the Random Forest shows an increased number of True Positives and True Negatives, as well as a lower number in FP-FN. This means it improves the model's capacities of predicting converting and non-converting leads.
Accuracy increased from 80% to 85.4%
# Configure GridSearchCV
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy')
# Fit the model to the training data
grid_search.fit(X_train, y_train)
# Better depth and average accuracy
best_depth = grid_search.best_params_['max_depth']
best_score = grid_search.best_score_
print(f"Best tree depth: {best_depth}")
print(f"Average accuracy (cross-validation): {best_score:.4f}")
# Graphing accuracy vs. depth
results = grid_search.cv_results_
plt.figure(figsize=(8,5))
sns.lineplot(x=param_grid['max_depth'], y=results['mean_test_score'], marker='o')
plt.title('Accuracy vs. Tree Depth (Cross Validation)')
plt.xlabel('Max Depth')
plt.ylabel('Average accuracy (CV)')
plt.axvline(x=best_depth, color='red', linestyle=':', linewidth=2, label=f'Best Depth = {best_depth}')
plt.grid(True)
plt.show()
Best tree depth: 9 Average accuracy (cross-validation): 0.8600
Choosing max_depth=9 (as found by cross-validation) is necessary to achieve the best balance between model complexity and generalization. This prevents overfitting and ensures the model performs well on new, unseen leads with and 86% accuracy.
# Plotting tree with best depth
plt.figure(figsize=(20,10))
# Plot with best depth
tree.plot_tree(dt_best, feature_names=X.columns, class_names=['Not Converted', 'Converted'], filled=True, max_depth=best_depth)
plt.title(f"Random Forest (first {best_depth} levels)")
plt.show()
# Variable importance
importances = pd.DataFrame({
'Variable': X.columns,
'Importance': rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)
plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Variable', data=importances.head(10), palette='viridis')
plt.title('Top 10 Most Important Features in Random Forest')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()
print(importances.head(10))
Variable Importance 8 first_interaction_Website 0.17869 5 engagement_score 0.17006 3 time_spent_on_website 0.15378 1 profile_completed 0.10686 4 page_views_per_visit 0.09772 0 age 0.09667 2 website_visits 0.05060 9 last_activity_Phone Activity 0.03313 7 current_occupation_Unemployed 0.03019 10 last_activity_Website Activity 0.01952
Random forest assigns an important weight to our engineered feature - engagement_score. This means it has a significant impact in predicting lead conversion.
Leads generated from the website are more likely to convert.
y_proba = rf_model.predict_proba(X_test)[:, 1]
# Calculate ROC curve and AUC. True Positive Rate vd False Positive Rate
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
# Plot ROC Curve
plt.figure(figsize=(6,6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0,1],[0,1], color='gray', linestyle='--')
plt.title('ROC Curve', fontsize=14)
plt.xlabel('False positive rate', fontsize=12)
plt.ylabel('True positive rate', fontsize=12)
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
AUC = 0.92 means the model has good discriminatory power:
If you randomly pick one converted lead and one non-converted lead, the model will rank the converted lead higher 92% of the time.
The curve is close to the top-left corner, which indicates:
High TPR (Recall): The model correctly identifies most converters. Low FPR: Few non-converters are incorrectly classified as converters.
Do we need to prune the tree?¶
# Prune the Random Forest by setting max_depth and other pruning parameters
rf_pruned = RandomForestClassifier(
n_estimators=100,
max_depth=best_depth, # Set to your optimal depth
min_samples_leaf=1, # You can adjust this as needed
min_samples_split=2, # You can adjust this as needed
random_state=1
)
print("Best depth:", best_depth)
rf_pruned.fit(X_train, y_train) # Fit on the resampled training data
# Evaluate the pruned model
y_pred = rf_pruned.predict(X_test) # Predict on the resampled test data
print("Accuracy on test set:", accuracy_score(y_test, y_pred)) # Use y_test for evaluation
print("\nClassification Report:\n", classification_report(y_test, y_pred)) # Use y_test for evaluation
Best depth: 9
Accuracy on test set: 0.8547687861271677
Classification Report:
precision recall f1-score support
0 0.87 0.92 0.90 962
1 0.80 0.70 0.75 422
accuracy 0.85 1384
macro avg 0.84 0.81 0.82 1384
weighted avg 0.85 0.85 0.85 1384
Setting max_depth=9 is a form of pruning: you are limiting the maximum depth of each tree to 9, which prevents overfitting and improves generalization.
Actionable Insights and Recommendations¶
Identify which leads are more likely to convert to paid customers¶
The ML models (Decision Tree and Random Forest) achieved high accuracy (85–87%) in predicting lead conversion.
Key predictors
- Time spent on website
- First interaction via website
- Profile completion level
- Page views per visit
- Age
Action
Use the trained model to score new leads and prioritize those with high predicted conversion probability for sales and follow-up.
# Segment Leads by propensity (Hot, Warm, Cold)
df = data_clean.copy()
# Define segmentation logic
def segment_lead(row):
if row['profile_completed'] == 2 and row['website_visits'] >= 3 and row['referral'] == 'Yes':
return 'Hot'
elif row['website_visits'] >= 2:
return 'Warm'
else:
return 'Cold'
# Apply segmentation
df['Lead_Segment'] = df.apply(segment_lead, axis=1)
# Summary of segments
segment_summary = df.groupby('Lead_Segment').agg({
'website_visits': 'mean',
'time_spent_on_website': 'mean',
'Lead_Segment': 'count'
}).rename(columns={'Lead_Segment': 'Count'})
print("Lead Segmentation Summary:")
print(segment_summary)
Lead Segmentation Summary:
website_visits time_spent_on_website Count
Lead_Segment
Cold 0.81270 550.56297 929
Hot 5.83333 1229.72222 18
Warm 4.10668 765.49304 3665
Most influential factors for conversion
time_spent_on_website (25.8%) is the top predictor: This means the more time a lead spends on the website, the higher the likelihood of conversion.
first_interaction_Website (18.2%) is also highly important: Leads whose first interaction was through the website are more likely to convert.
page_views_per_visit (12.1%) and age (11.3%) are also strong predictors: More page views per visit and certain age groups are associated with higher conversion rates.
Profile completion matters: profile_completed_3 (6.3%) and profile_completed_2 (4.4%) indicate that leads with more complete profiles are more likely to convert.
Other relevant factors: website_visits (6.2%): More visits to the website increase the chance of conversion.
**: The type of last activity also plays a role, being email the best one for predicting.
current_occupation_Unemployed (3.1%): This occupation group has some predictive power, but less than the top web interaction features.
Business implications: Focus on digital engagement: The most important features are related to user engagement on the website. This suggests that strategies to increase time spent, page views, and profile completion could significantly improve conversion rates.
Less importance for other channels: Features not listed here (such as magazine, newspaper, digital media, referrals) have negligible or zero importance, indicating they are not strong predictors of conversion in this dataset.
How to use these insights:
- For marketing: Invest in improving the website experience and encourage users to complete their profiles.
- For sales: Prioritize leads who spend more time on the website, have higher page views per visit, and whose first interaction was online.
- For product: Consider features that make it easier for users to explore more pages and complete their profiles.
Objective 2: Find the factors driving the lead conversion process¶
- Website engagement is critical:
Leads who spend more time and visit more pages on the website are much more likely to convert.
- Profile completion matters:
Leads with highly completed profiles (75–100%) have a significantly higher conversion rate.
- First interaction channel:
Leads whose first interaction is via the website convert at a much higher rate than those via the mobile app.
- Last activity:
Email is how customers interact and re-engage with the platform the most. Keep the customers engaged via this platform.
- Referral channel:
Although referrals generate fewer leads, their conversion rate is the highest (67.7%).
- Action:
Focus marketing efforts on increasing website engagement and encouraging profile completion. Improve the mobile app experience to boost conversion and increase visit time. Incentivize referrals, as they have the highest conversion rate.
Example Mobile App improvement:
- Earning rewards to promote engagement through the app.
Example: Referrals Campaign
- Website interaction is key. Prioritize web interactions, add chatbots, capture visitors with promotions, subscription links, special offers, pop-ups, etc.
Objective 3: Create a profile of leads likely to convert¶
Profile of high-converting leads:
Age: Typically older (mean age ~50 for professionals and unemployed; students are younger but convert less).
Occupation: Professionals convert most, followed by unemployed; students convert least.
First interaction: Website.
High profile completion (75–100%).
High website activity (more visits, more time spent, more pages viewed).
Last activity: Website or phone.
Referral source: Highest conversion rate.
Action: Target professionals and unemployed individuals with tailored messaging. Encourage all leads to complete their profiles and interact more with the website. Use the model to segment and prioritize leads for sales outreach.
- Woman, professional, 40-50 years old, upskilling from her phone during commute.
- Education for all - campaign example.
- Example campaign for Profile Completion
Business Recommendations
Prioritize leads with high website engagement and profile completion for sales follow-up.
Enhance the website and mobile app experience to increase time spent and page views.
Develop campaigns to encourage profile completion (e.g., progress bars, incentives).
Leverage referral programs, as they yield the highest conversion rates.
Use the ML model to automate lead scoring and resource allocation, focusing on leads most likely to convert.
🧠Modeling Insights
Algorithms Used: Decision Tree & Random Forest. Best Model: Pruned Random Forest (depth=9)
Accuracy: ~85.4% F1 Score (Class 1 - Converted): Improved from 0.68 to 0.75
Top Predictive Features:
Time spent on website (25.8%)
First interaction via Website
Profile completion level
Last activity type (Website > Email > Phone)
Referral source
Conclusion:
By connecting the model's results to the business objectives, ExtraaLearn can allocate resources more efficiently, improve conversion rates, and grow its customer base by focusing on the most promising leads and the factors that drive their conversion.